[AMDGPU][SILoadStoreOptimizer] Fix lds address operand offset#176816
Conversation
|
@llvm/pr-subscribers-backend-amdgpu Author: Ryan Mitchell (RyanRio) ChangesThe offset operand in GLOBAL_LOAD_ASYNC_TO_LDS_B128, for instance, is added to both the lds and global address, but SILoadStoreOptimizer is currently unaware of that. This PR inserts an add to counteract the offset meant for the global address. This one add is better than not doing the optimization at all, and having to insert 2 adds for each global address calculation (with no offset). Full diff: https://github.com/llvm/llvm-project/pull/176816.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index bd4029ca97014..2d13ee3800731 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -1020,6 +1020,14 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
return MI.getDesc().TSFlags & SIInstrFlags::LGKM_CNT;
}
+ static bool usesASYNC_CNT(const MachineInstr &MI) {
+ return MI.getDesc().TSFlags & SIInstrFlags::ASYNC_CNT;
+ }
+
+ bool usesASYNC_CNT(uint16_t Opcode) const {
+ return get(Opcode).TSFlags & SIInstrFlags::ASYNC_CNT;
+ }
+
// Most sopk treat the immediate as a signed 16-bit, however some
// use it as unsigned.
static bool sopkIsZext(unsigned Opcode) {
diff --git a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
index dc1592e5fb82a..f50333b5886a7 100644
--- a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
@@ -279,6 +279,7 @@ class SILoadStoreOptimizer {
void updateBaseAndOffset(MachineInstr &I, Register NewBase,
int32_t NewOffset) const;
+ void updateAsyncLDSAddress(MachineInstr &MI, int32_t OffsetDiff) const;
Register computeBase(MachineInstr &MI, const MemAddress &Addr) const;
MachineOperand createRegOrImm(int32_t Val, MachineInstr &MI) const;
std::optional<int32_t> extractConstOffset(const MachineOperand &Op) const;
@@ -2311,6 +2312,27 @@ void SILoadStoreOptimizer::processBaseWithConstOffset(const MachineOperand &Base
Addr.Offset = (*Offset0P & 0x00000000ffffffff) | (Offset1 << 32);
}
+// Maintain the correct LDS address for async loads.
+// It becomes incorrect when promoteConstantOffsetToImm
+// adds an offset only meant for the src operand.
+void SILoadStoreOptimizer::updateAsyncLDSAddress(MachineInstr &MI,
+ int32_t OffsetDiff) const {
+ if (!TII->usesASYNC_CNT(MI) || OffsetDiff == 0)
+ return;
+
+ Register OldVDst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst)->getReg();
+ Register NewVDst =
+ MRI->createVirtualRegister(TRI->getRegClassForReg(*MRI, OldVDst));
+ MachineBasicBlock &MBB = *MI.getParent();
+ DebugLoc DL = MI.getDebugLoc();
+ BuildMI(MBB, MI, DL, TII->get(AMDGPU::V_ADD_U32_e64), NewVDst)
+ .addReg(OldVDst)
+ .addImm(-OffsetDiff)
+ .addImm(0);
+
+ MI.getOperand(0).setReg(NewVDst);
+}
+
bool SILoadStoreOptimizer::promoteConstantOffsetToImm(
MachineInstr &MI,
MemInfoMap &Visited,
@@ -2440,7 +2462,9 @@ bool SILoadStoreOptimizer::promoteConstantOffsetToImm(
// Instead of moving up, just re-compute anchor-instruction's base address.
Register Base = computeBase(MI, AnchorAddr);
- updateBaseAndOffset(MI, Base, MAddr.Offset - AnchorAddr.Offset);
+ int32_t OffsetDiff = MAddr.Offset - AnchorAddr.Offset;
+ updateBaseAndOffset(MI, Base, OffsetDiff);
+ updateAsyncLDSAddress(MI, OffsetDiff);
LLVM_DEBUG(dbgs() << " After promotion: "; MI.dump(););
for (auto [OtherMI, OtherOffset] : InstsWCommonBase) {
@@ -2451,7 +2475,9 @@ bool SILoadStoreOptimizer::promoteConstantOffsetToImm(
if (TLI->isLegalFlatAddressingMode(AM, AS)) {
LLVM_DEBUG(dbgs() << " Promote Offset(" << OtherOffset; dbgs() << ")";
OtherMI->dump());
- updateBaseAndOffset(*OtherMI, Base, OtherOffset - AnchorAddr.Offset);
+ int32_t OtherOffsetDiff = OtherOffset - AnchorAddr.Offset;
+ updateBaseAndOffset(*OtherMI, Base, OtherOffsetDiff);
+ updateAsyncLDSAddress(*OtherMI, OtherOffsetDiff);
LLVM_DEBUG(dbgs() << " After promotion: "; OtherMI->dump());
}
}
diff --git a/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm-gfx12.mir b/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm-gfx12.mir
new file mode 100644
index 0000000000000..3f243f39fa585
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm-gfx12.mir
@@ -0,0 +1,74 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
+# RUN: llc -mtriple=amdgcn -mcpu=gfx1250 -verify-machineinstrs -start-before=si-load-store-opt -stop-before=si-insert-waitcnts --verify-machineinstrs -o - %s | FileCheck -check-prefixes=GFX1250 %s
+
+---
+name: promote_async_load_offset
+machineFunctionInfo:
+ stackPtrOffsetReg: '$sgpr32'
+ frameOffsetReg: '$sgpr33'
+tracksRegLiveness: true
+body: |
+ bb.0.entry:
+ liveins: $vgpr0, $sgpr0_sgpr1, $ttmp7
+ ; GFX1250-LABEL: name: promote_async_load_offset
+ ; GFX1250: liveins: $ttmp7, $vgpr0, $sgpr0_sgpr1
+ ; GFX1250-NEXT: {{ $}}
+ ; GFX1250-NEXT: renamable $vgpr1 = V_LSHLREV_B32_e32 8, $vgpr0, implicit $exec
+ ; GFX1250-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 $vgpr0, 512, 0, implicit $exec
+ ; GFX1250-NEXT: renamable $vgpr3, dead $sgpr_null = V_ADDC_U32_e64 0, killed $vgpr0, killed $vcc_lo, 0, implicit $exec
+ ; GFX1250-NEXT: renamable $vgpr1 = disjoint V_OR_B32_e32 0, killed $vgpr1, implicit $exec
+ ; GFX1250-NEXT: renamable $vgpr0 = V_ADD_U32_e32 256, $vgpr1, implicit $exec
+ ; GFX1250-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr0, $vgpr2_vgpr3, -256, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
+ ; GFX1250-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
+ %0:vgpr_32(s32) = COPY $vgpr0
+ %5:vgpr_32 = V_LSHLREV_B32_e64 8, %0:vgpr_32(s32), implicit $exec
+ %7:vgpr_32 = disjoint V_OR_B32_e64 %5:vgpr_32, 0, implicit $exec
+ %8:vgpr_32 = disjoint V_OR_B32_e64 %5:vgpr_32, 0, implicit $exec
+ %32:vgpr_32, %34:sreg_32_xm0_xexec = V_ADD_CO_U32_e64 %0:vgpr_32, 256, 0, implicit $exec
+ %33:vgpr_32, %35:sreg_32_xm0_xexec = V_ADDC_U32_e64 %0:vgpr_32, 0, killed %34:sreg_32_xm0_xexec, 0, implicit $exec
+ %17:vreg_64_align2 = REG_SEQUENCE %32:vgpr_32, %subreg.sub0, %33:vgpr_32, %subreg.sub1
+ GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed %7:vgpr_32, killed %17:vreg_64_align2, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
+ %40:vgpr_32, %36:sreg_32_xm0_xexec = V_ADD_CO_U32_e64 %0:vgpr_32, 512, 0, implicit $exec
+ %41:vgpr_32, %37:sreg_32_xm0_xexec = V_ADDC_U32_e64 %0:vgpr_32, 0, killed %36:sreg_32_xm0_xexec, 0, implicit $exec
+ %21:vreg_64_align2 = REG_SEQUENCE %40:vgpr_32, %subreg.sub0, %41:vgpr_32, %subreg.sub1
+ GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed %8:vgpr_32, killed %21:vreg_64_align2, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
+...
+
+---
+name: promote_async_load_offset_two_offsets
+machineFunctionInfo:
+ stackPtrOffsetReg: '$sgpr32'
+ frameOffsetReg: '$sgpr33'
+tracksRegLiveness: true
+body: |
+ bb.0.entry:
+ liveins: $vgpr0, $sgpr0_sgpr1, $ttmp7
+ ; GFX1250-LABEL: name: promote_async_load_offset_two_offsets
+ ; GFX1250: liveins: $ttmp7, $vgpr0, $sgpr0_sgpr1
+ ; GFX1250-NEXT: {{ $}}
+ ; GFX1250-NEXT: renamable $vgpr1 = V_LSHLREV_B32_e32 8, $vgpr0, implicit $exec
+ ; GFX1250-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 $vgpr0, 516, 0, implicit $exec
+ ; GFX1250-NEXT: renamable $vgpr3, dead $sgpr_null = V_ADDC_U32_e64 0, killed $vgpr0, killed $vcc_lo, 0, implicit $exec
+ ; GFX1250-NEXT: renamable $vgpr1 = disjoint V_OR_B32_e32 0, killed $vgpr1, implicit $exec
+ ; GFX1250-NEXT: renamable $vgpr0 = V_ADD_U32_e32 260, $vgpr1, implicit $exec
+ ; GFX1250-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr0, $vgpr2_vgpr3, -260, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
+ ; GFX1250-NEXT: renamable $vgpr0 = V_ADD_U32_e32 4, $vgpr1, implicit $exec
+ ; GFX1250-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr0, $vgpr2_vgpr3, -4, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
+ ; GFX1250-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
+ %0:vgpr_32(s32) = COPY $vgpr0
+ %5:vgpr_32 = V_LSHLREV_B32_e64 8, %0:vgpr_32(s32), implicit $exec
+ %7:vgpr_32 = disjoint V_OR_B32_e64 %5:vgpr_32, 0, implicit $exec
+ %8:vgpr_32 = disjoint V_OR_B32_e64 %5:vgpr_32, 0, implicit $exec
+ %32:vgpr_32, %34:sreg_32_xm0_xexec = V_ADD_CO_U32_e64 %0:vgpr_32, 256, 0, implicit $exec
+ %33:vgpr_32, %35:sreg_32_xm0_xexec = V_ADDC_U32_e64 %0:vgpr_32, 0, killed %34:sreg_32_xm0_xexec, 0, implicit $exec
+ %17:vreg_64_align2 = REG_SEQUENCE %32:vgpr_32, %subreg.sub0, %33:vgpr_32, %subreg.sub1
+ GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed %7:vgpr_32, killed %17:vreg_64_align2, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
+ %40:vgpr_32, %36:sreg_32_xm0_xexec = V_ADD_CO_U32_e64 %0:vgpr_32, 512, 0, implicit $exec
+ %41:vgpr_32, %37:sreg_32_xm0_xexec = V_ADDC_U32_e64 %0:vgpr_32, 0, killed %36:sreg_32_xm0_xexec, 0, implicit $exec
+ %21:vreg_64_align2 = REG_SEQUENCE %40:vgpr_32, %subreg.sub0, %41:vgpr_32, %subreg.sub1
+ GLOBAL_LOAD_ASYNC_TO_LDS_B128 %8:vgpr_32, killed %21:vreg_64_align2, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
+ %42:vgpr_32, %38:sreg_32_xm0_xexec = V_ADD_CO_U32_e64 %0:vgpr_32, 516, 0, implicit $exec
+ %43:vgpr_32, %39:sreg_32_xm0_xexec = V_ADDC_U32_e64 %0:vgpr_32, 0, killed %38:sreg_32_xm0_xexec, 0, implicit $exec
+ %22:vreg_64_align2 = REG_SEQUENCE %42:vgpr_32, %subreg.sub0, %43:vgpr_32, %subreg.sub1
+ GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed %8:vgpr_32, killed %22:vreg_64_align2, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
+...
|
arsenm
left a comment
There was a problem hiding this comment.
Can you also include an end to end ir test
d8ebc76 to
e418da0
Compare
| return std::nullopt; | ||
|
|
||
| // Handle S_MOV_B64_IMM_PSEUDO for 64-bit immediates | ||
| if (Def->getOpcode() == AMDGPU::S_MOV_B64_IMM_PSEUDO) |
There was a problem hiding this comment.
Use getConstValDefinedInReg? Or extract SIFoldOperands::getImmOrMaterializedImm into SIInstrInfo
There was a problem hiding this comment.
I chose to extract getImmOrMaterializedImm, maybe there's also room for improvement to clean up that getConstValDefinedInReg, but that's probably outside the scope of this PR
🐧 Linux x64 Test Results
✅ The build succeeded and all tests passed. |
|
@arsenm Please merge on clean CI |
|
|
||
| return Def->getOperand(1).getImm(); | ||
| // Now extract the base register (which should be a 64-bit VGPR). | ||
| MachineInstr *BaseDef = MRI->getVRegDef(BaseOp->getReg()); |
There was a problem hiding this comment.
BaseDef is unused when building without asserts.
…toreOptimizer.cpp Fixes llvm#176816 (13b20e7).
… async stores (#180220) `updateAsyncLDSAddress`, introduces by #176816, previously only handled async loads , where the LDS address is in the `vdst` operand. Therefore Async stores produced a nullptr dereference since the LDS address is in `vdata` for those instructions. --------- Co-authored-by: Jay Foad <jay.foad@gmail.com>
…etToImm for async stores (#180220) `updateAsyncLDSAddress`, introduces by llvm/llvm-project#176816, previously only handled async loads , where the LDS address is in the `vdst` operand. Therefore Async stores produced a nullptr dereference since the LDS address is in `vdata` for those instructions. --------- Co-authored-by: Jay Foad <jay.foad@gmail.com>
The offset operand in GLOBAL_LOAD_ASYNC_TO_LDS_B128, for instance, is added to both the lds and global address, but SILoadStoreOptimizer is currently unaware of that. This PR inserts an add to counteract the offset meant for the global address. This one add is better than not doing the optimization at all, and having to insert 2 adds for each global address calculation (with no offset).
This PR also promotes the global address to an offset when the offset is calculated with V_ADD_U64 on applicable gfx versions, (and inversely adds the LDS offset), whereas previously the optimization opportunity was missed entirely.